Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readd clustering #281

Merged
merged 78 commits into from
Feb 24, 2022
Merged

Readd clustering #281

merged 78 commits into from
Feb 24, 2022

Conversation

SimDing
Copy link
Contributor

@SimDing SimDing commented Jan 28, 2022

Concerning issue #116

This adds the clustering that was removed in #89 again, not including any user interfaces.

Clustering

The clustering is implemented in the de.jplag.clustering package.
It includes two clustering algorithms (spectral and agglomerative), preprocessing, decoupling logic, clustering options and a factory class which can be used to run the clustering in just two statements.

Algorithms

Agglomerative Clustering

Uses a bottom-up approach to successively merge similar clusters. It stops once there are no clusters left that are similar enough to merge. An implementation of this algorithm was originally included in the code base, but removed in #89. It is still included because it is a much simpler approach than the spectral clustering.

Spectral Clustering

Spectral clustering is a clustering approach specifically for for graph data. This matches the problem since the similarities between all submissions can be thought of as a fully connected graph.
Spectral clustering works by computing the Laplace Matrices of the graphs and representing the nodes as k-dimensional vectors using it's Eigenvectors.
At that point the resulting vectors can be clustered using a space-partitioning algorithm, I used k-Means++.
Still for both k-Means as well as the reduction to k dimensions, the unknown final number of clusters k needs to be known.
In addition, k-Means++ yields probabilistic results.
To find a good choice for k and a "good" clustering I employ Baysian Optimization.
A metric I found in line with my notion of a "good" clustering is:

"The average of the clusters modularity times their average inner cluster similarity over the number of the clusters connections."
With modularity I mean the measure introduced by Newman, M.; Girvan, M. in "Finding and evaluating community structure in networks (2004)".

Spectral clustering is used by default.

Preprocessing

As it can be advantageous to apply some preprocessing before clustering (in particular with Spectral clustering) I included three options for preprocessing.

CDF Preprocessor

Estimates the cumulative distribution function of all similarities and multiplies each similarity with the CDF evaluated at that similarity. This has the effect of driving the lowest similarities close to zero while hardly changing the highest ones.
Since this preprocessor is non-parametric and worked well during my experiments I made it the default.

Threshold Preprocessor

Suppresses all similarities below a given threshold. Good values for the threshold vary greatly with the set of input submissions.

Percentile Preprocessor

Is the same as the threshold preprocessor, but the threshold is given as a percentile of the calculated similarities, making it more robust.

Options and CLI

The (many) options for clustering are all defined in a new ClusteringObjects class. This class also contains sane defaults, that should allow users the run the clustering without specifying any additional CLI flags or defining them programmatically.

If refactored the CommandLineArgument enum a little because I did need to add addition optimal parameters and that many constructors became confusing. It now works in a builder-pattern-ish fashion. I also added a small class for dealing with groups of arguments (Clustering and Clustering - Preprocessing in the help text).

This is how the help message is now displayed:

Benutzung: jplag [-h] [-l {java,python3,cpp,csharp,char,text,scheme}] [-bc BC] [-v {quiet,long}] [-d] [-S S] [-p P] [-x X] [-t T] [-m M] [-n N] [-r R] [-c {normal,parallel}]
                 [--cluster-skip] [--cluster-alg {AGGLOMERATIVE,SPECTRAL}] [--cluster-metric {AVG,MIN,MAX,INTERSECTION}] [--cluster-spectral-bandwidth bandwidth]
                 [--cluster-spectral-noise noise] [--cluster-spectral-min-runs min] [--cluster-spectral-max-runs max] [--cluster-spectral-kmeans-interations iterations]
                 [--cluster-agglomerative-threshold threshold] [--cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}] [--cluster-pp-none | --cluster-pp-cdf |
                 --cluster-pp-percentile percentile | --cluster-pp-threshold threshold] rootDir [rootDir ...]

JPlag - Maintained by SDQ
Created by IPD Tichy, Guido Malpohl, and others. JPlag logo designed by Sandro Koch. Currently maintained by Sebastian Hahner and Timur Saglam.

Positions-Argumente:
  rootDir                Root-directory that contains submissions

Benannte Argumente:
  -h, --help             zeigt diese Hilfe und beendet sich.
  -l {java,python3,cpp,csharp,char,text,scheme}
                         Select the language to parse the submissions (Standard: java)
  -bc BC                 Path of the directory containing the base code (common framework used in all submissions)
  -v {quiet,long}        Verbosity of the logging (Standard: quiet)
  -d                     Debug parser. Non-parsable files will be stored (Standard: false)
  -S S                   Look in directories <root-dir>/*/<dir> for programs
  -p P                   comma-separated list of all filename suffixes that are included
  -x X                   All files named in this file will be ignored in the comparison (line-separated list)
  -t T                   Tunes the comparison sensitivity by adjusting  the  minimum  token  required  to  be  counted  as  a  matching  section. A smaller <n> increases the
                         sensitivity but might lead to more false-positives
  -m M                   Comparison similarity threshold [0-100]: All comparisons above this threshold will be saved (Standard: 0.0)
  -n N                   The maximum number of comparisons that will be shown in the generated report, if set to -1 all comparisons will be shown (Standard: 30)
  -r R                   Name of the directory in which the comparison results will be stored (Standard: result)
  -c {normal,parallel}   Comparison mode used to compare the programs (Standard: normal)

Clustering:
  --cluster-skip         Skips the clustering (Standard: false)
  --cluster-alg {AGGLOMERATIVE,SPECTRAL}
                         Which clustering algorithm to use. Agglomerative merges similar  submissions  bottom  up. Spectral clustering is combined with Bayesian Optimization
                         to execute the k-Means clustering algorithm multiple times, hopefully finding a "good" clustering automatically. (Standard: SPECTRAL)
  --cluster-metric {AVG,MIN,MAX,INTERSECTION}
                         The metric used for clustering. AVG is intersection over union, MAX can expose some attempts of obfuscation. (Standard: MAX)
  --cluster-spectral-bandwidth bandwidth
                         Bandwidth of the matern kernel in the Gaussian Process used  during  the  search  for  a  good number of clusters for spectral clustering. If a good
                         clustering result is found during the search, numbers of clusters  that  differ  by  something  in range of the bandwidth are also expected to good.
                         (Standard: 20.0)
  --cluster-spectral-noise noise
                         The result of each run in the search for good clusterings are random.  The  noise level models the variance in the "worth" of these results. It also
                         acts as a regularization constant. (Standard: 0.0025000002)
  --cluster-spectral-min-runs min
                         Minimum number of k-Means executions during spectral clustering. With these initial clustering sizes are explored. (Standard: 5)
  --cluster-spectral-max-runs max
                         Maximum number of k-Means executions during spectral  clustering.  Any  execution  after  the  initial  runs tries to balance between exploration of
                         unknown clustering sizes and exploitation of clustering sizes known as good. (Standard: 50)
  --cluster-spectral-kmeans-interations iterations
                         Maximum number of iterations during each execution of the k-Means algorithm. (Standard: 200)
  --cluster-agglomerative-threshold threshold
                         Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. (Standard: 0.2)
  --cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}
                         How to measure the similarity of two clusters during  agglomerative  clustering.  Minimum,  maximum or average similarity between the submissions in
                         each cluster. (Standard: AVERAGE)

Clustering - Preprocessing:
  --cluster-pp-none      Do not use any preprocessing before clustering. Not recommended for spectral clustering. (Standard: false)
  --cluster-pp-cdf       Before clustering, the value of the cumulative distribution function  of  all  similarities is estimated. The similarities are multiplied with these
                         estimates. This has the effect of supressing similarities that are low compared to other similarities. (Standard: false)
  --cluster-pp-percentile percentile
                         Any similarity smaller than the given percentile will be suppressed during clustering.
  --cluster-pp-threshold threshold
                         Any similarity smaller than the given threshold value will be suppressed during clustering.

Technical Stuff

  • This adds two new maven dependecies: commons-math (k-Means++, vectors, matrices, and many small algorithms used here and there) and mockito (testing)
  • I tried to minimize coupling between the clustering code and other code
    • The main package de.jplag is only coupled to de.jplag.clustering through the ClusteringOptions and ClusteringFactory classes
    • The clustering package de.jplag.clustering is only coupled to de.jplag through the JPlagComparison and Submission classes
    • The clustering code only uses JPlagComparison and Submission in the ClusteringFactory and ClusteringAdapter classes, the latter replacing submissions with integer indices and comparisons with matrices.

@SimDing
Copy link
Contributor Author

SimDing commented Jan 28, 2022

I'm not quite finished yet, but I have a question and would like you to see my current state:

Currently I've put every setting about the clustering plainly inside the JPlagOptions class. This feels pretty messy. Do you have a suggestion?

@tsaglam
Copy link
Member

tsaglam commented Jan 31, 2022

Currently I've put every setting about the clustering plainly inside the JPlagOptions class. This feels pretty messy. Do you have a suggestion?

That is a good question. Currently, I count 14 additional options you added in this PR. The question here would be how many of those the user really modifies. If some are parameters that might be tweaked in the future, but most users will not change the default values, then we should not expose these parameters as options. Even from a usability context, modifying 14 options for clustering alone via the CLI seems excessive and very unlikely (think about the flag you need to define).
Thus we can maybe reduce the number of options to the user by keeping some clustering parameters as internal parameters that may be tweaked by devs but not the users (also on options: they may be tweaked when using JPlag programmatically but not via the CLI).

From a technical standpoint, you could encapsulate the clustering parameters in a data object, but that does not solve the problem of settings these options via the CLI.

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-WS21/22 Tasks for the PISE practical course (WS21/22) labels Jan 31, 2022
@tsaglam tsaglam added this to the v3.1.0 milestone Jan 31, 2022
@tsaglam tsaglam linked an issue Feb 9, 2022 that may be closed by this pull request
3 tasks
@SimDing
Copy link
Contributor Author

SimDing commented Feb 10, 2022

The question here would be how many of those the user really modifies

I think the defaults are set kind of sane, so I hope most users would not have to change much.

Even from a usability context, modifying 14 options for clustering alone via the CLI seems excessive

Users would not use all options at the same time.

I see two cases in which a user would want to change the options:

  • Clustering takes too long?
    • Disable clustering
    • Use a threshold preprocessor
    • (spectral) use less kMeans iterations
    • (spectral) use less runs
  • The result is not good? The problem is that then there is not really much users can do but fiddle with the parameters that change the result of the clustering. I don't know any one of those parameters that would be best kept from users. In that case at most five parameters would be set (preprocessor, preprocessor option, noise, kernel bandwidth, and similarity metric).

A user who had both problems would use 7 options at most.

The only thing I can really remove without bad aftertaste is the option about pruning bad clusters. There does not seem to be a practical reason to look at those.

@tsaglam tsaglam self-assigned this Feb 11, 2022
Copy link
Member

@dfuchss dfuchss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments :)

Comment on lines +34 to +35
float submissionSimilarity = (float) similarityMatrix.getEntry(leftSubmission, rightCluster.get(rightIndex));
similarity = (float) this.accumulator.applyAsDouble(similarity, submissionSimilarity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of casting simply use BiFunction<Double, Double, Float> or replace float by double

@tsaglam

This comment was marked as outdated.

@tsaglam tsaglam merged commit bffc5c9 into jplag:master Feb 24, 2022
@sebinside sebinside mentioned this pull request Mar 15, 2022
@sebinside sebinside mentioned this pull request Apr 11, 2022
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-WS21/22 Tasks for the PISE practical course (WS21/22)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Readd clustering and min/max/avg scores
4 participants